Introduction to Open Data Science - Course Project

About the project

I have learned about the course Introduction to Open Data Science through my PhD studies. I’m exited to start the course but also a bit worried about the workload. The course Moodle page seems to contain a lot of information, so probably will take a while to get a handle on things.

I have previously used R and Rstudio on some courses, but I’m no expert. I think that exercise 1 was too long, and all in all there was too much stuff to be learned about R in the first week. To me, this one really long exercise feels exhausting. It is nice that there are the R codes already embedded, but I feel smaller chunks would be better. If I hadn’t done R before, I would be terrified right now.

Link to my GitHub repository


Regression and model validation

date()
## [1] "Sat Nov 26 09:40:37 2022"

Basic information about dataset

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5      v purrr   0.3.4 
## v tibble  3.1.2      v dplyr   1.0.10
## v tidyr   1.1.3      v stringr 1.4.0 
## v readr   1.4.0      v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
students2014 <- read_csv("./data/learning2014.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )
dim(students2014)
## [1] 166   7
str(students2014)
## spc_tbl_ [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )

Dataset is created from data: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
Data is collected from students in course ‘Johdatus yhteiskuntatilastotieteeseen’ in fall 2014.

Dataset students2014 contains 7 variables with 166 observations: gender, age, attitude, deep, stra, surf, and points.

All variables in the original data related to strategic, deep, and surface learning have been combined into three variables (stra, deep, and surf, respectively), and scaled to original scales by taking the mean.

Attitude tells about student’s global attitude towards statistics. Variable has been scaled back to the original scale of the questions, by dividing it with the number of questions.

Points tells the course exam points. Observations with points equaling 0 have been excluded, since those students didn’t attend the course exam.

Describing the data

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(factor(students2014$gender))
##   F   M 
## 110  56
summary(students2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(students2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(students2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(students2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(students2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(students2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

There are 66% (110/166) women in respondents. The median age is 22 years (range 17-55), the median age of women seems to be a bit lower than that of men.

The median attitude is 3.2 (range 1.4-5). The median attitude of men seems higher than that of women.

The median for deep learning is 3.7 (range 1.6-4.9). The median for strategic learning is 3.1 (range 1.3-5). The median for surface learning is 2.8 (range 1.6-4.3). For women, the median for strategic and surface learning seems a bit higher than for men.

The median for exam points is 23 (range 7-33). The median is about the same in men and women.

Attitude correlates with exam points, so that with “better” attitude towards statistics one usually gets higher exam points.

Surface learning correlates negatively with deep and strategic learning as well as attitude. So with better attitude one less often utilizes surface learning. If one utilizes surface learning, one less often utilizes deep or strategic learning.

Interpreting the fitted regression model

#fit a linear regression model using points as the outcome and attitude, strategig and surface learning as explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#remove surface learning from the model, since it's so far from significant
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
#error rate
sigma(my_model2)*100/mean(students2014$points)
## [1] 23.28148

First we choose three variables that correlate the most with points in above visualization (namely attitude, strategic and surface learning). However, after fitting the model, we see that for surface learning there is a really high p-value (0.466), so we remove that and fit the model again with just attitude and strategic learning.

In the new model, p-value of the F-statistic is 7.734e-09. So at least one of the explanatory variables is significantly associated with exam points.

For attitude, p-value is <0.001, so it is significantly associated with exam points. According to this model, if attitude increases by 1, exam points increase by approximately 3.5.

For strategic learning, p-value is 0.089. It isn’t <0.05 that’s usually deemed significant, however, it is quite close. So we probably don’t want to state that strategic learning has no effect on exam points. On the other hand, the estimated intercept for strategic learning is 0.9, so a change in attitude makes a much bigger difference than a change in strategic learning.

The adjusted R-squared is 0.1951, which means that attitude (and strategic learning) only explain around 20% of the variance in exam points.

The residual standard error is 5.289 which corresponds to 23% prediction error rate.

Assumptions of the model

plot(my_model2, which=c(1,2,5))

Linear regression assumes a linear relationship between predictors and outcome, that residual errors are normally distributed, that residuals have a constant variance, and independence of residual error terms.

With Residuals vs. Fitted plot we can check the linearity of data. Here, the red line seems to be horizontal at approximately zero, where it should be. So we can assume a linear relationship.

With plot Normal Q-Q we can check if residual errors are normally distributed. The plot of residuals should more or less follow the dashed line. Here they mostly do so, with some exceptions at the upper and lower ends. However, we may assume normality.

The Residuals vs. Leverage plot marks three most extreme points (35, 71, 145). Two of them exceed 3 standard deviations, so they are possible outliers.


Logistic regression

date()
## [1] "Sat Nov 26 09:40:48 2022"

Short description of dataset

library(tidyverse)
alc <- read_csv("./data/alc.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   age = col_double(),
##   Medu = col_double(),
##   Fedu = col_double(),
##   traveltime = col_double(),
##   studytime = col_double(),
##   famrel = col_double(),
##   freetime = col_double(),
##   goout = col_double(),
##   Dalc = col_double(),
##   Walc = col_double(),
##   health = col_double(),
##   failures = col_double(),
##   absences = col_double(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double(),
##   alc_use = col_double(),
##   high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Dataset is created from data: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The data approaches student achievement in secondary education of two Portuguese schools, and the dataset joins two student alcohol consumption datasets. The variables not used for joining the two data have been combined by averaging (including the grade variables). ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

Dataset alc contains 35 variables with 370 observations.

My personal hypothesis about relationships of with alcohol consumption

I think interesting variables in relation to alcohol consumption are sex, number of past class failures, final grade (G3) and number of school absences.
My hypothesis for these variables are:
1. Men’s alcohol consumption is higher than women’s.
2. Those who consume more alcohol have more past class failures compared to those who consume less alcohol.
3. Those who consume more alcohol have lower final grades compared to those who consume less alcohol.
4. Those who consume more alcohol have more school absences compared to those who consume less alcohol.

Description of chosen variables and their relationships with alcohol consumption

library(dplyr); library(ggplot2); library(GGally); library(finalfit)


dependent <- "high_use"
explanatory <- c("sex", "failures", "G3", "absences")
alc %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
## Warning in chisq.test(failures, high_use): Chi-squared approximation may be
## incorrect
##  Dependent: high_use                FALSE       TRUE      p
##                  sex         F 154 (59.5)  41 (36.9) <0.001
##                              M 105 (40.5)  70 (63.1)       
##             failures         0 238 (91.9)  87 (78.4)  0.003
##                              1   12 (4.6)  12 (10.8)       
##                              2    8 (3.1)    9 (8.1)       
##                              3    1 (0.4)    3 (2.7)       
##                   G3 Mean (SD) 11.8 (3.4) 10.9 (3.0)  0.011
##             absences Mean (SD)  3.7 (4.5)  6.4 (7.1) <0.001
#median and quartiles of the numeric variables
summary(alc$failures)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1892  0.0000  3.0000
summary(alc$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.52   14.00   18.00
summary(alc$absences)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   4.511   6.000  45.000
#draw a barplot of alcohol consumption by sex
ggplot(data = alc, aes(x = high_use, fill=sex)) + geom_bar()

#produce summary statistics of final grade by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_grade=mean(G3),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_grade count
##   <chr> <lgl>         <dbl> <int>
## 1 F     FALSE          11.4   154
## 2 F     TRUE           11.8    41
## 3 M     FALSE          12.3   105
## 4 M     TRUE           10.3    70
#draw a boxplot of final grade by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade")

#produce summary statistics of school abscences by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_absences=mean(absences),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_absences count
##   <chr> <lgl>            <dbl> <int>
## 1 F     FALSE             4.25   154
## 2 F     TRUE              6.85    41
## 3 M     FALSE             2.91   105
## 4 M     TRUE              6.1     70
#draw a boxplot of school absences by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

There are 195 women and 175 men. Of those, 21% (41/195) and 40% (70/175) have high alcohol consumption, respectively. This supports my hypothesis of men consuming more alcohol than women.

The mean of past class failures is 0.2 (range 0-3). Of high users, 3% (3/111) have 3 failures, 8% (9/111) have 2 failures, and 11% have 1 failure. Of others, 0.4% (1/259) have 3 failures, 3% (8/259) have 2 failures, and 5% have 1 failure. Thus it would seem that students with high alcohol consumption have failed class more often than those with low alcohol consumption.

The mean final grade is 11.5 (range 0-18). Men with high alcohol consumption seem to have lower final grades than people with low alcohol consumption. Interestingly, women with high alcohol consumption seem to have around the same final grades than people with low alcohol consumption. So my hypothesis is supported only by observations of men.

The mean number of absences is 4.5 (range 0-45, median 3). The median of school absences is higher for men with high alcohol consumption than other groups. However, if we look at the mean, women with high alcohol consumption have the highest mean of absences, and both men and women with high alcohol consumption have higher mean than those with low alcohol consumption. This is somewhat in line with my hypothesis.

Logistic regression

# fit a logistic regression model 
my_model <- glm(high_use ~ sex + failures + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1561  -0.8429  -0.5872   1.0033   2.1393  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.38733    0.51617  -2.688  0.00719 ** 
## sexM         1.00870    0.24798   4.068 4.75e-05 ***
## failures     0.50382    0.22018   2.288  0.02213 *  
## absences     0.09058    0.02322   3.901 9.56e-05 ***
## G3          -0.04671    0.03948  -1.183  0.23671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 405.59  on 365  degrees of freedom
## AIC: 415.59
## 
## Number of Fisher Scoring iterations: 4
# fit a new model without G3 model 
my_model2 <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model2)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1550  -0.8430  -0.5889   1.0328   2.0374  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.94150    0.23129  -8.394  < 2e-16 ***
## sexM         0.99731    0.24725   4.034 5.49e-05 ***
## failures     0.59759    0.20698   2.887  0.00389 ** 
## absences     0.09245    0.02323   3.979 6.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.99  on 366  degrees of freedom
## AIC: 414.99
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(my_model2) %>% exp

# compute confidence intervals (CI)
CI <- confint(my_model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## sexM        2.7109881 1.67967829 4.4367282
## failures    1.8177381 1.21997630 2.7642305
## absences    1.0968598 1.05041937 1.1508026

In the first model, final grade has a p-value 0.2, and thus isn’t associated with high alcohol consumption. Let’s fit the model again without it. In the second model, all variables are associated with the outcome, and also the AIC drops from 415.59 to 414.99, which indicates that G3 can be dropped from the model.

According to this model, high alcohol consumption is associated with being male (OR 2.7, 95% CI 1.7-4.4, p-value <0.001), having failed class (OR 1.8, 95% CI 1.2-2.8, p-value 0.004), and school absences (OR 1.1, 95% CI 1.1-1.2, p-value <0.001).
My primary hypothesis were correct regarding being male and having failed classes. However, there is only slight association with school absence (OR 1.1), and the final grade doesn’t seem to be associated with high alcohol consumption.

Predictive power of the model

# predict() the probability of high_use
probabilities <- predict(my_model2, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# draw a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use)) + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

In the 2x2 table we can see that 252 low and 33 high alcohol consumers are predicted correctly. In the predictions, however, there are 78 false negatives and 7 false positives. According to the training error, 23% of the individuals are classified inaccurately. This is better than tossing a coin (where the chances would be 50/50), but I guess with some background information one could get quite close with educated guesses.
## 10-fold cross-validation of the model

# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
#Let's try the same with another model. 
my_model3 <- glm(high_use ~ traveltime + sex + absences +studytime + goout, data = alc, family = "binomial")
summary(my_model3)
## 
## Call:
## glm(formula = high_use ~ traveltime + sex + absences + studytime + 
##     goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8435  -0.7908  -0.4914   0.6936   2.5211  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.72342    0.67103  -5.549 2.88e-08 ***
## traveltime   0.35147    0.18157   1.936 0.052908 .  
## sexM         0.81643    0.27225   2.999 0.002710 ** 
## absences     0.07919    0.02293   3.454 0.000552 ***
## studytime   -0.40974    0.17327  -2.365 0.018041 *  
## goout        0.71830    0.12152   5.911 3.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 363.51  on 364  degrees of freedom
## AIC: 375.51
## 
## Number of Fisher Scoring iterations: 4
prob2 <- predict(my_model3, type = "response")
loss_func(class = alc$high_use, prob = prob2)
## [1] 0.2027027
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = my_model3, K = 10)
cv2$delta[1]
## [1] 0.2108108

The prediction error of the model when using 10-fold cross-validation is around 0.24, which is slightly less than in the exercise set.
With the other tested model (explanatory variables being sex, home to school travel time, number of school absences, weekly study time, and going out with friends), the prediction error with 10-fold cross-validation is around 0.22, which is still lower.


Clustering and classification

date()
## [1] "Sat Nov 26 09:40:53 2022"

Short description of dataset

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data 
data("Boston")

# check structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset contains housing values in suburbs of Boston. It can be downloaded from R’s MASS package, and contains 506 observations of 14 variables. More information on the data and variables can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Graphical overview of the data and summaries of the variables

library(tidyr)
# check summary of data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# draw pairs plot of the variables
pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

The summary of the data shows means, medians and ranges of the different variables, for example median number of rooms per dwelling is 6.2 (range 3.6-8.8).

Tax rate and accessibility to radial highways seem to be strongly positively correlated. Median value and lower status of the population are negatively correlated. The plotted correlation matrix shows also other positive and negative correlations. The darker blue a sphere is, the stronger the positive correlation between variables, and the darker red a sphere is, the stronger the negative correlation.

Standardizing the dataset

library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

boston_scaled$crim <- as.numeric(boston_scaled$crim)

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

The variables are now transformed on the same scale, so now we can compare them.

Linear discriminant analysis and predictions

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2623762 0.2475248 0.2500000 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       1.0623042 -0.9737973 -0.15056308 -0.8995018  0.4601568 -0.8819699
## med_low  -0.1055932 -0.2299906 -0.01233188 -0.5406381 -0.1296533 -0.3484640
## med_high -0.4006284  0.2382999  0.23949396  0.4386949  0.0837805  0.4276585
## high     -0.4872402  1.0171306 -0.03844192  1.0355576 -0.3292932  0.8111877
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9485473 -0.6965304 -0.7471819 -0.49327095  0.37446906 -0.76119547
## med_low   0.3266006 -0.5582386 -0.4699227 -0.07576364  0.31999182 -0.12890466
## med_high -0.4189792 -0.4030438 -0.2802181 -0.24828966  0.07994062  0.01203417
## high     -0.8541998  1.6379981  1.5139626  0.78062517 -0.75845061  0.88700601
##                  medv
## low       0.527191704
## med_low   0.007100809
## med_high  0.145284175
## high     -0.663589680
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09274900  0.713399265 -0.87881825
## indus    0.07315641 -0.340287926  0.46396636
## chas    -0.10129945 -0.069721138  0.09819383
## nox      0.37148040 -0.632326731 -1.44874037
## rm      -0.07867775 -0.050032220 -0.17057754
## age      0.20779845 -0.198451832 -0.23755017
## dis     -0.09273396 -0.136042584  0.11668886
## rad      3.18099891  0.864673450  0.08899972
## tax     -0.01732131  0.059112429  0.39569648
## ptratio  0.10182160  0.014060525 -0.29473522
## black   -0.10318516 -0.006286222  0.10384031
## lstat    0.17357363 -0.120483235  0.31729751
## medv     0.15905225 -0.274484908 -0.20223647
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9448 0.0405 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      16        0    0
##   med_low    1      17        2    0
##   med_high   0      10       15    1
##   high       0       0        0   26

It would seem that with high crime rate the predictions are most accurate. With low to medium and medium to high there is most inaccuracy.

Distance measures and k-means clustering

library(ggplot2)
data(Boston)
boston_scaled2 <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 3)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

From plotting the total within sum of squares we see that around two the value changes quite a lot, so the appropriate number of cluster would be two. The data seems to divide nicely between two clusters according to most variables.

Bonus

data(Boston)
boston_scaled3 <- as.data.frame(scale(Boston))

# k-means clustering
set.seed(123)
km2 <- kmeans(boston_scaled3, centers = 3)

# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~., data = boston_scaled3)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 4)

Variables age, black, and tax seem to be the most influential linear separators for the clusters.

Super-Bonus

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
lda.fit3 <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit3$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit3$scaling
matrix_product <- as.data.frame(matrix_product)

classes <- as.numeric(train$crime)
train2 <- dplyr::select(train, -crime)
set.seed(123)
km3 <- kmeans(train2, centers = 4)
clusters <- as.numeric(km3$cluster)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=clusters)

The cluster that is on the left-hand side seems quite similar in both plots and is quite well defined. In the second plot the clusters are more defined, in the first plot the rest of the clusters mix more with each other.


(more chapters to be added similarly as we proceed with the course!)